9 Data Structures in R (2.1)

9.1 Learning Outcomes

By the end of this tutorial, you should:

understand the five most commonly-used data structures in R
be able to create and manipulate these data structures
be familiar with the ‘tibbles’ data structure

9.2 Reading

Before this tutorial you should access and review the following papers:

R. Rein and D. Memmert, “Big data and tactical analysis in elite soccer: Future challenges and opportunities for sports science,” SpringerPlus, vol. 5, no. 1, p. 1410, Dec. 2016.[1]
L. Corain, R. Arboretti, R. Ceccato, F. Ronchi, and L. Salmaso, “Testing and ranking on round-robin design for data sport analytics with application to basketball,” Statistical Modelling, vol. 19, no. 1, pp. 5–27, 2019. [2]
F. Lord, D. B. Pyne, M. Welvaert, and J. K. Mara, “Methods of performance analysis in team invasion sports: A systematic review,” Journal of Sports Sciences, vol. 38, no. 20, pp. 2338–2349, Oct. 2020.[3]

There are direct links to these papers via the library reading list.

9.3 Introduction

In R, data structures are the fundamental ways in which data is organized and stored for use in our analysis and modeling.

R has various types of data structures, each optimised for different kinds of tasks. In this tutorial we’ll identify the five most common data structures that you’re likely to use within sport data analytics, and explore how to create and manipulate data within these structures.

9.4 Type One: Matrices

A matrix is a two-dimensional data structure in R.

It’s used to store and organize data in rows and columns, similar to a spreadsheet. It’s important to note that, just like vectors, all elements within a matrix must be of the same data type.

9.4.1 Creating matrices

We use the ‘matrix()’ function to create a matrix by specifying the dataset, the number of rows, and the number of columns in the matrix. For example:

data <- c(1, 2, 3, 4, 5, 6) # create data
matrix_1 <- matrix(data, nrow = 2, ncol = 3) 
matrix_2 <- matrix(data, nrow = 3, ncol = 2)

# print these to console window
print(matrix_1)

     [,1] [,2] [,3]
[1,]    1    3    5
[2,]    2    4    6

print(matrix_2)

     [,1] [,2]
[1,]    1    4
[2,]    2    5
[3,]    3    6

9.4.2 Accessing matrix elements

We use square brackets ‘[ ]’ with row and column indices to access elements in a matrix.

Important

Remember: R uses 1-based indexing!

first_row_second_column <- matrix_1[1, 2]
entire_second_row <- matrix_1[2, ]
entire_third_column <- matrix_1[, 3]

print(first_row_second_column)

[1] 3

print(entire_second_row)

[1] 2 4 6

print(entire_third_column)

[1] 5 6

9.4.3 Modifying matrices

You can add or update elements by assigning values using row and column indices:

matrix_1[1, 1] <- 42
matrix_2[, 2] <- c(7, 8, 9)

print(matrix_1)

     [,1] [,2] [,3]
[1,]   42    3    5
[2,]    2    4    6

print(matrix_2)

     [,1] [,2]
[1,]    1    7
[2,]    2    8
[3,]    3    9

9.4.4 Matrix operations

You can also perform arithmetic and logical operations on matrices, such as element-wise addition, subtraction, multiplication, and division:

A <- matrix(c(1, 2, 3, 4), nrow = 2) # create our first matrix
B <- matrix(c(5, 6, 7, 8), nrow = 2) # create our second matrix
sum_matrix <- A + B
product_matrix <- A * B

print(sum_matrix)

     [,1] [,2]
[1,]    6   10
[2,]    8   12

print(product_matrix)

     [,1] [,2]
[1,]    5   21
[2,]   12   32

9.4.5 Matrix functions

We can apply functions to matrices to perform various operations, such as calculating the transposed matrix, row and column sums, and more:

transpose_matrix <- t(A)
row_sums <- rowSums(A)
col_sums <- colSums(A)

print(transpose_matrix)

     [,1] [,2]
[1,]    1    2
[2,]    3    4

9.4.6 Matrix multiplication

Use the ‘*’ operator to perform matrix multiplication (not element-wise):

multiplied_matrix <- A * t(B)

print(multiplied_matrix)

     [,1] [,2]
[1,]    5   18
[2,]   14   32

9.5 Type Two: Arrays

Like a matrix, an ‘array’ can also be used to store data in a structured manner.

While a matrix is a two-dimensional structure (with rows and columns), an array is multi-dimensional.

It is unlikely that you will need to deal with arrays, but it’s worth knowing that they are there if you need them!

There’s a good introduction to Arrays in R here. https://www.tutorialspoint.com/r/r_arrays.htm

9.6 Type Three: Lists

Lists are used to store and organize a collection of elements. Unlike vectors and matrices, lists can store elements of different data types and structures, such as numbers, characters, vectors, matrices, data frames, and even other lists.

9.6.1 Creating a list

You can use the list() function to create a list by combining elements:

simple_list <- list(42, "celtic", TRUE)
nested_list <- list(number = 42, text = "hello", vector = c(1, 2, 3), matrix = matrix(1:4, nrow = 2))

9.6.2 Accessing list elements

You can use double square brackets [[ ]] or the dollar sign with an index or a name to access elements in a list:

first_element <- simple_list[[1]]
named_element <- nested_list$text
third_element <- nested_list$vector

9.6.3 Modifying lists

You can add, update, or remove elements by assigning values using indexing or names:

simple_list[[2]] <- "banana"
nested_list$new_element <- "Morton are great!"
nested_list$number <- NULL # removes the 'number' element

9.6.4 List operations

You can also perform operations on elements within a list using indexing or names to access them:

sum_vector <- nested_list$vector + c(4, 5, 6) new_matrix <- nested_list$matrix * 2

9.6.5 List functions

You can apply functions to lists to perform different operations, such as calculating the length of the list or extracting specific elements from it:

list_length <- length(simple_list) # returns the list length
first_two_elements <- simple_list[1:2]  # returns the first two elements of the list

9.6.6 Converting lists

You can convert a list to other data structures using functions such as ‘unlist()’, ‘as.data.frame()’, or ‘as.matrix()’, as long as the list’s structure permits it:

vector_from_list <- unlist(simple_list)
dataframe_from_list <- as.data.frame(nested_list)

9.7 Type Four: Data Frames

Data frames are a core data structure in R, and are used to store and organize data in a tabular format with rows and columns.

Before the introduction of tibbles (see below), they were the most common data structure encountered while using R.

Data frames are similar to matrices, but can store columns of different data types, making them ideal for handling datasets with mixed data types. They closely resemble the way that data is stored in a spreadsheet application such as Excel, where you can have different types of data in different columns within your worksheet.

9.7.1 Creating data frames

You use the ‘data.frame()’ function to create a data frame by combining vectors or other data structures as columns:

names <- c("Scotland", "England", "Wales")
ages <- c(25, 30, 22)
heights <- c(165, 180, 172)
data <- data.frame(Name = names, Age = ages, Height = heights) # this creates a dataframe called 'data'

9.7.2 Accessing elements in a data frame

You can use square brackets [ ], double square brackets [[ ]], or the dollar sign with row and column indices or names to access elements, rows, or columns in your data frame.

For example:

first_row <- data[1, ]
age_column <- data$Age # note how we refer to a specific vector (variable) within the dataframe
third_row_second_column <- data[3, "Age"]

9.7.3 Modifying data frames

You can add, update, or remove elements, rows, or columns by assigning values using indexing or names.

data$Name[1] <- "Alicia"     # change an element
data$Weight <- c(60, 85, 75) # add a new column
data[4, ] <- c("David", 23, 185, 80) # add a new row
data$Weight <- NULL # Remove the 'weight' column

9.7.4 Data frame operations

You can also perform operations on elements, rows, or columns within a data frame using indexing or names to access them:

data$Age <- as.numeric(data$Age) # we need to convert data$Age to a numeric variable type
avg_age <- mean(data$Age) # we can then do some calculations on it
tall_people <- data[data$Height > 175, ]

9.7.5 Data frame functions

You can apply functions to data frames to perform various operations, such as calculating the dimensions, extracting specific elements, and more:

num_rows <- nrow(df)
num_columns <- ncol(df)
column_names <- colnames(df)
row_names <- rownames(df)

9.7.6 Subsetting (filtering) data frames

You can use logical conditions, column indices, or column names to filter or subset data frames:

adults <- data[data$Age >= 18, ]
name_age <- data[, c("Name", "Age")]

You can also use this approach to remove a variable from a data frame:

data_02 <- subset(data_01, select = -c(X)) # creates a new data frame without variable 'X'

9.8 Type Five: Tibbles

‘Tibbles’ are a recent introduction to R, as part of the tidyverse package. They are intended to make data manipulation more straightforward, and you will increasingly see them being used in preference to the older ‘data frame’ structure.

Tibbles offer several improvements over data frames, such as better printing in the console, the ability to handle column names with special characters or spaces, and automatic data type detection.

Tibbles are an integral part of the ‘tidyverse’ package and work well with other tidyverse functions and packages.

As with all additional packages, you need to install and load the tidyverse package before you can use tibbles:

install.packages("tidyverse")  # only use if you've not already got tidyverse installed
library(tidyverse)

9.8.1 Creating tibbles

You can use the tibble() function to create a tibble, by combining vectors or other data structures as columns:

names <- c("Alice", "Bob", "Charlie")
ages <- c(25, 30, 22)
heights <- c(165, 180, 172)
tb <- tibble(Name = names, Age = ages, Height = heights)

9.8.2 Converting data frames to tibbles

Use the ‘as_tibble()’ function to convert an existing data frame to a tibble:

df <- data.frame(Name = names, Age = ages, Height = heights)
tb <- as_tibble(df)

Why would you want to use tibbles rather than data frames? Well, there are a few reasons:

Tibbles have a refined print method that shows only the first 10 rows and all the columns that fit on screen, making them much easier to work with for large datasets.
Unlike data frames, tibbles do not simplify the results of subsetting operations into the lowest possible dimension; they always return another tibble. This means you won’t unexpectedly get a vector when you thought you were working with a data frame.
Tibbles allow column names that don’t meet R’s variable naming rules, like those that don’t start with a letter, or those that include spaces. This can be useful when working with datasets that have unusual column names.
Tibbles are more “lazy” and “surly” than data frames, in that they delay most operations (like filtering or sorting) until they’re explicitly asked to perform them, and they’re more stringent about data types. This can make tibbles a bit slower than data frames for some operations, but it also helps prevent some common data cleaning and manipulation errors.
If a single row is selected from a data frame using square brackets, a data frame returns a vector. Tibbles, however, always return a tibble, which provides a more consistent behavior.

You will find that tibbles can provide more robust, predictable, and user-friendly behaviour than traditional data frames, particularly when dealing with large or complex datasets.

9.8.3 Accessing tibble elements

Similar to data frames, use square brackets [], double square brackets [[]], or the dollar sign with row and column indices or names to access elements, rows, or columns in a tibble:

first_row <- tb[1, ]
age_column <- tb$Age
third_row_second_column <- tb[3, "Age"]

9.8.4 Modifying tibbles

Add, update, or remove elements, rows, or columns by assigning values using indexing or names:

tb$Name[1] <- "Alicia"
tb$Weight <- c(60, 85, 75) # Add a new column
tb <- add_row(tb, Name = "David", Age = 23, Height = 185, Weight = 80) # Add a new row
tb$Weight <- NULL # Remove the 'Weight' column

9.8.5 Tibble operations

Perform operations on elements, rows, or columns within a tibble using indexing or names to access them:

avg_age <- mean(tb$Age)
tall_people <- tb[tb$Height > 175, ]

9.8.6 Tibble functions

Apply functions to tibbles to perform various operations such as calculating the dimensions, extracting specific elements, and more:

num_rows <- nrow(tb)
num_columns <- ncol(tb)
column_names <- colnames(tb)
row_names <- rownames(tb)

9.8.7 Subsetting tibbles

Use logical conditions, column indices, or column names to filter or subset tibbles:

adults <- tb[tb$Age >= 18, ]
name_age <- tb[, c("Name", "Age")]

This introduction to tibbles has really just scratched the surface of this data structure. Hadley Wickham has provided an excellent and comprehensive coverage here.

9.9 Activity: Creating and manipulating various data structures

The following activity allows you to practise some of the techniques covered above. You may also have to do some research to find out how to complete some of the challenges.

Install and load necessary packages (e.g. tidyverse)
Create a dataframe of a hypothetical football team’s player statistics. Include PlayerName, GoalsScored, and Assists.
Print the dataframe to the console.
Convert the above dataframe to a tibble and print it.
Add a new player’s statistics to the dataframe and a new column for “GamesPlayed.”
Using dplyr, filter out players who’ve scored more than 5 goals.
Calculate a new column “GoalPerGame” and get the average goals scored by the team.
Arrange players by goals scored in descending order.
Create a new column called ‘Position’. For each player, provide a value that represents the position they play.
Finally, group players by position and get the total goals scored for each position.

9.10 Solutions

Install and load necessary packages.

Show the answer

    library(tidyverse)

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.3     ✔ readr     2.1.4
✔ forcats   1.0.0     ✔ stringr   1.5.0
✔ ggplot2   3.4.4     ✔ tibble    3.2.1
✔ lubridate 1.9.3     ✔ tidyr     1.3.0
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

Create a dataframe of a hypothetical football team’s player statistics.

Show the answer

    football_df <- data.frame(
      PlayerName = c("John", "Mike", "Lucas", "Eva"),
      GoalsScored = c(5, 10, 3, 6),
      Assists = c(4, 6, 2, 3)
    )
    print(football_df)

  PlayerName GoalsScored Assists
1       John           5       4
2       Mike          10       6
3      Lucas           3       2
4        Eva           6       3

Convert the above dataframe to a tibble and print it.

Show the answer

    football_tibble <- as_tibble(football_df)
    print(football_tibble)

# A tibble: 4 × 3
  PlayerName GoalsScored Assists
  <chr>            <dbl>   <dbl>
1 John                 5       4
2 Mike                10       6
3 Lucas                3       2
4 Eva                  6       3

Add a new player’s statistics to the dataframe and a new column for “GamesPlayed.”

Show the answer

football_df <- rbind(football_df, data.frame(PlayerName = "Sophia", GoalsScored = 7, Assists = 5))
football_df$GamesPlayed <- c(10, 12, 9, 11, 10)
print(football_df)

  PlayerName GoalsScored Assists GamesPlayed
1       John           5       4          10
2       Mike          10       6          12
3      Lucas           3       2           9
4        Eva           6       3          11
5     Sophia           7       5          10

Using dplyr, filter out players who’ve scored more than 5 goals.

Show the answer

top_scorers <- football_df %>% filter(GoalsScored > 5)
print(top_scorers)

  PlayerName GoalsScored Assists GamesPlayed
1       Mike          10       6          12
2        Eva           6       3          11
3     Sophia           7       5          10

Calculate a new column “GoalPerGame” and get the average goals scored by the team.

Show the answer

football_df <- football_df %>%
  mutate(GoalPerGame = GoalsScored / GamesPlayed)
avg_goals <- football_df %>%
  summarise(mean_goals = mean(GoalsScored))
print(football_df)

  PlayerName GoalsScored Assists GamesPlayed GoalPerGame
1       John           5       4          10   0.5000000
2       Mike          10       6          12   0.8333333
3      Lucas           3       2           9   0.3333333
4        Eva           6       3          11   0.5454545
5     Sophia           7       5          10   0.7000000

Show the answer

print(avg_goals)

  mean_goals
1        6.2

Arrange players by goals scored in descending order.

Show the answer

sorted_df <- football_df %>%
  arrange(desc(GoalsScored))
print(sorted_df)

  PlayerName GoalsScored Assists GamesPlayed GoalPerGame
1       Mike          10       6          12   0.8333333
2     Sophia           7       5          10   0.7000000
3        Eva           6       3          11   0.5454545
4       John           5       4          10   0.5000000
5      Lucas           3       2           9   0.3333333

Group players by position and get the total goals scored for each position.

Show the answer

football_df$Position <- c("Forward", "Midfielder", "Defender", "Forward", "Midfielder")
position_goals <- football_df %>%
  group_by(Position) %>%
  summarise(TotalGoals = sum(GoalsScored))
print(position_goals)

# A tibble: 3 × 2
  Position   TotalGoals
  <chr>           <dbl>
1 Defender            3
2 Forward            11
3 Midfielder         17